Analyzing Neurotransmitter Receptors & Protein Sequences

Ákos Kimpián, Joshua Lembeck, Elvin Kalinowski, Mikel Garcia Amez, Marcel Skumantz

2025-02-12

Introduction

  • @Mikel add information here

Data Set info

Channels

Material & Methods

- Dirtying + Cleaning
- EDA
- PCA
- Prediction

Dirtying – Cleaning

  1. Split data

Combine via .row_id

  1. Insert missing values

  • Random missingness
  • Imputation sensible
  1. Insert outliers

EDA

Prediction preprocessing

  • AA composition variables (aa_*) as features
  • Receptor classes using pattern-based annotation of Protein_Name as target
    • Cys-loop receptors
    • Ionotropic glutamate receptors
    • Other ionotropic receptors

PCA

  • Examine structure before modelling
  • using tidymodels
  • PC1 <-> PC2 scatter
#...
pca_rec <- recipe(~., data = prediction_df) |>
  update_role(Protein_ID, Receptor_class, 
              new_role = "id") |>
  step_normalize(all_predictors()) |>
  step_pca(all_predictors())
#...

Predictive Modeling

  • Stratified \(80\)/\(20\) train–test split to maintain class balance
  • Random Forest classifier with \(1000\) trees
  • Basic Metrics and Mean Decrease Gini (MDG) as feature importance
#...
pca_rec <- recipe(~., data = prediction_df) |>
  update_role(Protein_ID, Receptor_class, 
              new_role = "id") |>
  step_normalize(all_predictors()) |>

rf_spec <- rand_forest(trees = 1000) |>
  set_engine("randomForest") |>
  set_mode("classification")
#...

Results

PCA and classification results

Length and weight correlation analysis

Discussion (Joshua)

Biological Interpretation

  • Valine, Glycine, Tryptophan, Serine, and Proline are most discriminative
  • Proline and Glycine disrupt α-helices and β-sheets
  • Proline is dominant in loops and turns

Limitations and Future Directions

  • Success of the analysis?
    • prediction of Cys-loop and Glutamate ion receptors very accurate
    • other receptors are not well predicted due to lack of representation in the data set
  • What could be explored more in detail
    • test model against larger variety of background proteins